Julia Piaskowski
May 14, 2025
https://some-link-to-talk.html
Randomized quantile residual was proposed in the literature [in 1996] to circumvent the…problems in traditional residuals [for generalized linear models]. However, this approach has not gained deserved awareness and attention, partly due to the lack of extensive empirical studies to investigate its performance.
Deep Dive into quantile residuals for model evaluation and the most popular implementation of them, DHARMa.
DHARMa was first released in 2016. CRAN downloads were sparse until 2019, and began steadily increasing. It averages approximately ~500 downloads per day (source: https://hadley.shinyapps.io/cran-downloads/)
It has reverse dependencies with 18 R packaged including glmmTMB, performance, easystats
DHARMa does not have a published peer reviewed manuscript paper supporing its release
\[Y_{ij} = X_i\beta + Z_ja + \epsilon_{ij} \]
\[\epsilon \sim N(0, \sigma^2 \mathbf{I_n}) \] \[a \sim N(0, \sigma_a^2) \]
\[ y_i|a_{j} \sim 𝑁(x_i \beta + a_j, \sigma^2) \]
\[ y_i \sim 𝑁(x_i \beta, \sigma^2) \]
Raw \[\epsilon_i = Y_i - \hat{Y_i} \]
(internally) Studentized:
\[\epsilon_i = \frac {Y_i - \hat{Y_i}} {\hat{s}}\]
Pearson/Scaled:
\[ \epsilon_i = \frac {Y_i - \hat{Y_i}} {sd(Y_i)} \]
Dependent variable: \(Y_{ij}\) (e.g. survival)
survival: \(p_i = Y_i/N\)
\[Y_{ij}|r_j \sim Binomial(N, \pi_{ij})\]
\[\eta_{ij} = \eta + \alpha_i + r_j\] (\(\eta\) = mean, \(\alpha_i\) = treatment, \(r_j\) = grouping variable)
\[\eta_{ij} = log(\frac{\pi_{ij}} {1 - \pi_{ij}} )\]
Different distributional assumptions and mathematical mean/variance relationships of some distributions require a different approach
For \(y_1,...,y_n\) responses:
\[y_i \sim \mathcal{P}(\mu_i, \phi)\]
CDF: \[F(y; \mu, \phi)\]
For a continuous \(F\), \(F(y_i; \mu_i, \phi)\) are uniformly distributed, and the quantile residuals are:
\[ r_{i,q} = \Phi^{-1} \left\{F(y_i; \hat{\mu_i}, \hat{\phi} )\right\}\]
Dunn KP, and GK Smyth (1996). Randomized quantile residuals. J of Comp & Graph Stats 5, 1-10.
Survival: \(y_i \sim Exp(\mu_i)\)
\[ \text{log }\mu_i = \beta_0 + \beta_1 \text{ log } x_i \]
Quantile residuals:
\[ r_{i} = \Phi^{-1} \left\{ 1-exp(y_i/\hat{\mu_i}) \right\}\]
\[r_i = \Phi^{-1}(u_i)\] \[ u_i \sim Uniform(a_i, b_j]\] \[a_i = lim_{y \rightarrow y_i} F(y; \hat{\mu_i}, \hat{\phi})\]
\[ b_i = F(y_i; \hat{\mu_i}, \hat{\phi}) \]
First submitted to CRAN in February, 2003.
Randomization is used to produce continuously distributed residuals when the response is discrete or has a discrete component. This means that the quantile residuals will vary from one realization to another for a given data set and fitted model.
(Diagnostics for HierArchical Regression Models)
\[\mathbf{Y = X\beta+Za}\] \[ \mathbf{Y|a\sim G(\mu, R)} \] linear predictor: \(\mathbf{\eta = X\beta}\)
fitted value: \(E(Y) = \mu\)
Link function: \(\eta = g(\mu)\)
For each observation in the data set:
Simulate new observations (\(n_{sim} = n_{data}\)) using fitted values as the model distributional parameters (e.g. shape, scale) as appropriate. \[ Y_i \sim G(\hat{\mu_i}, \hat{\phi}) \]
Calculate the quantile for the cumulative distribution function
\[ r_{i} = F(y_i; \hat{\mu_i}, \hat{\phi} )\]
\[ r_{ij} \sim Uniform(0,1) \]
\(r_{ij} = 0\): everything is larger
\(r_{ij} = 1\): everything is smaller
\(r_{ij} = 0.5\): right in the middle
DHARMa runs 250 simulations (per observation) by default, they recommend up to 1000
\[ (\hat{Y_i} = \lambda = 5,; \quad Y_i = 7 )\]
Create a data set with a random effect (“group”, 10 levels), a covariate (“Environment1”) and a Poisson-distributed response variable (“observedResponse”):
Analyze correctly and incorrectly:
Correctly Specified Model
Incorrectly Specified Model
lm(), `glm(), lme4, mgcv, glmmTMB spaMM, GLMMadaptive, brms, & morelme4::glmer() Marginal ModelBy default, simulations are conducted on a marginal model:
lme4::glmer() Conditional ModelUse re.form argument in glmer() to resimulate the data.
The implementation is a very recent addition and lacks clear documention (see the entire discussion).
Set the simulation conditions for a glmmTMB object. The object is altered in place, hence why it’s copied here.
(Implemented in DHARMa)
Feng et al (2018) demonstrated for many simulated scenarios that quantile residuals are a better diagnostic tool for checking model distribution, model parameters, over/under dispersion, and overall lack of fit of GLMMs than ‘standard’ residuals.
The plot of residuals versus fitted is the most reliable indicator of a properly specified model
We do not know their overall sensitivity of this time to modest model misspecifications
Results from quantiles tests for uniformity should be treated with caution - no pattern proves a model is correct; likewise, nonrandom matterns are not always a deal-breaker
Outliers identified are warnings of possible outliers and lack quantitative information
Bates DM, Maechler M, Bolker BM and S Walker (2015). “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01.
Brooks ME, Kristensen K, van Benthem KJ, Magnusson A, Berg CW, Nielsen A, Skaug HJ, Maechler M and BM Bolker (2017). “glmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling.” The R Journal, 9(2), 378-400. doi: 10.32614/RJ-2017-066.
Dunn KP, and GK Smyth (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics 5, 1-10.
Feng et al (2017). Randomized quantile residuals: an omnibus model diagnostic tool with unified reference distribution. arXiv https://doi.org/10.48550/arXiv.1708.08527
Pinheiro JC and DM Bates (2000). Mixed-Effects Models in S and S-PLUS. Springer Verlag, New York.
Stroup WW, Ptukhina M and J Garai (2024). Generalize Linear Mixed Models: Modern Concepts, Methods and Applications. CRC Press, Boca Raton.
(G)LMMs are hard - harder than you may think based on what you may have learned in your second statistics class….
\[y_i \sim binomial(n, p_i) \]
\[ \text{logit} (p_i) = \beta_0 + \beta_1x_i \] \[r_i = \Phi^{-1}(u_i)\]
\[ u_i \sim Uniform(a_i, b_j] \]
\[a_i = lim_{y \uparrow y_i} F(y; n, \hat{p_i}) =\displaystyle\sum_{i=0}^{\lfloor k-1 \rfloor} \begin{pmatrix} n \\ i \end{pmatrix} p_i(1-p)^{n-i}\]
\[b_i = F(y_i; n, \hat{p_i}) =\displaystyle\sum_{i=0}^{\lfloor k \rfloor} \begin{pmatrix} n \\ i \end{pmatrix} p_i(1-p)^{n-i}\]
This is the ‘wrong model’ (missing ‘environment’), but the plot looks okay
When plotted by the missing group: